First Steps with Stan

Maud goes to Vegas

Andrew B. Collier

15 May 2018

<!-- reveal_plugins: ["notes"] -->

Maud’s Burning Questions

  • What is the hit rate?
  • What is the RTP?
  • What is the hit rate for each winning combination?

Maud’s Data (by Session)

sessions %>% select(-details)
# A tibble: 100 x 7
   session spins  hits wager payout  hit_rate       rtp
     <int> <int> <int> <int>  <dbl>     <dbl>     <dbl>
 1       1     7     2    10      3 0.2857143 0.3000000
 2       2    19     7    30     29 0.3684211 0.9666667
 3       3    19     3    22      3 0.1578947 0.1363636
 4       4    26     7    30     13 0.2692308 0.4333333
 5       5    23     8    31     35 0.3478261 1.1290323
 6       6    20     8    26     12 0.4000000 0.4615385
 7       7    22     5    30     20 0.2272727 0.6666667
 8       8    22     4    25     10 0.1818182 0.4000000
 9       9    18     4    26      6 0.2222222 0.2307692
10      10    26     8    33     75 0.3076923 2.2727273
# ... with 90 more rows

Maud’s Data (by Spin)

sessions %>% select(session, details) %>% unnest()
# A tibble: 1,972 x 4
   session  spin wager payout
     <int> <int> <int>  <dbl>
 1       1     1     1      1
 2       1     2     1      0
 3       1     3     1      0
 4       1     4     3      0
 5       1     5     2      0
 6       1     6     1      2
 7       1     7     1      0
 8       2     1     2      0
 9       2     2     2      0
10       2     3     1      1
# ... with 1,962 more rows

Binomial Process

Probability distribution for binomial process:

\[ P(k | n, \theta) = \binom{n}{k} \theta^k (1 - \theta)^{n - k} \]

where

  • \(k\) successes in \(n\) trials where
  • the probability of success on any trial is \(\theta\).

\[ \text{hit rate} = \theta^* = \frac{\text{hits}}{\text{spins}} \]

with(sessions, sum(hits) / sum(spins))
[1] 0.3128803

# A tibble: 1 x 2
  session_avg session_std
        <dbl>       <dbl>
1   0.3100018   0.1030164

95% confidence interval () extends from 28.9% to 33.1%.

For session \(i\) there were \(k_i\) hits from \(n_i\) spins.

Assuming that sessions are independent:

\[ P(k | n, \theta) = \prod_i P(k_i | n_i, \theta) = L(\theta; n, k) \]

Log-likelihood for binomial process (multiple trials):

\[ LL(\theta; n, k) = \sum_i k_i \log{\theta} + (n_i - k_i) \log{(1 - \theta)} \]

# theta - probability [single value]
# k - number of successes [vector]
# n - number of trials [vector]
#
binom_log_likelihood <- function(theta, k, n) {
  sum(k * log(theta) + (n - k) * log(1 - theta))
}

parameters %>% arrange(desc(log_likelihood)) %>% head()
# A tibble: 6 x 2
  theta log_likelihood
  <dbl>          <dbl>
1  0.32      -1225.604
2  0.30      -1226.146
3  0.34      -1228.649
4  0.28      -1230.543
5  0.36      -1235.078
6  0.26      -1239.142

Bayes

Bayesian Inference

Bayes’ Theorem \[ p(\theta|y, X) = \frac{p(y|X, \theta) p(\theta)}{p(y)} \propto p(y|\theta) p(\theta) \] where

  • \(y\) are observations;
  • \(X\) are predictors;
  • prior — \(p(\theta)\) is parameter distribution before data;
  • likelihood — \(p(y|X, \theta)\) is probability of data given parameters; and
  • posterior — \(p(\theta|y, X)\) is parameter distribution given data.

… the theory of inverse probability is founded upon error and must be wholly rejected. Sir Ronald Fisher (1925)

Monte Carlo

Vanilla Monte Carlo

Generate independent samples of \(\theta^{(i)}\).

To do this we need to have \(p(\theta^{(m)} | y, X)\).

Markov Chain Monte Carlo (MCMC)

Generate a series of samples: \(\theta^1\), \(\theta^2\), \(\theta^3\), … \(\theta^M\) where \(\theta^m\) depends on \(\theta^{m-1}\).

To do this we need to have \(p(\theta^{m} | \theta^{m-1}, y, X)\).

Metropolis-Hastings Algorithm

Randomly sample \(\theta^{(0)}\).

Set \(i = 1\).

  1. Randomly sample proposal \(\theta^{*}\) in the vicinity of \(\theta^{i-1}\).
  2. Sample \(u\) uniformly on \([0, 1)\).

\[ \theta^{(i)} = \left\{\begin{array}{ll} \theta^{*} & \text{if } u \cdot p(\theta^{i-1}|y, X) < p(\theta^{*}|y, X) \\ \theta^{(i-1)} & \text{otherwise.} \end{array}\right. \]

  1. Return to 1.

Stan Workflow

  1. Data story (informs the model)
  2. Write Stan program (likelihood and priors).
  3. Stan parser converts this to C++.
  4. Compile C++.
  5. Execute compiled binary.
  6. Evaluate results. (Possibly return to 2.)
  7. Inference based on posterior sample.

To use rstan you need a .stan and a .R .

Stan Skeleton

data {
  int<lower=0> N;
  int hits[N];
  int spins[N];
}
parameters {
  real theta;
}
model {
  hits ~ binomial(spins, theta);       // Likelihood
  theta ~ uniform(0, 1);               // Prior
}

library(rstan)

# Use all available cores.
#
options(mc.cores = parallel::detectCores())

trials <- list(
  N       = nrow(sessions),
  hits    = sessions$hits,
  spins   = sessions$spins
)

fit <- stan(
  file    = "binomial-no-prior.stan",
  data    = trials,
  chains  = 2,                         # Number of Markov chains
  warmup  = 500,                       # Warmup iterations per chain
  iter    = 1000                       # Total iterations per chain
)

class(fit)
[1] "stanfit"
attr(,"package")
[1] "rstan"
samples <- as.data.frame(fit)

head(samples)
      theta      lp__
1 0.2973639 -1228.064
2 0.3302369 -1228.237
3 0.2955706 -1228.346
4 0.2978613 -1227.991
5 0.3017639 -1227.505
6 0.3228558 -1227.345
nrow(samples)
[1] 1000

Scale for 'y' is already present. Adding another scale for 'y', which will replace the existing scale.

print(fit, probs = c(0.025, 0.5, 0.975))
Inference for Stan model: binomial-no-prior.
2 chains, each with iter=1000; warmup=500; thin=1; 
post-warmup draws per chain=500, total post-warmup draws=1000.

          mean se_mean   sd     2.5%      50%    97.5% n_eff Rhat
theta     0.31    0.00 0.01     0.30     0.31     0.33   386 1.01
lp__  -1227.38    0.03 0.61 -1229.18 -1227.14 -1226.91   465 1.01

Samples were drawn using NUTS(diag_e) at Fri May 11 03:42:39 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Use summary() to get information for each chain.

data {
  int<lower=0> N;
  int hits[N];
  int spins[N];
}
parameters {
  real<lower=0,upper=1> theta;
}
model {
  hits ~ binomial(spins, theta);       // Likelihood
  theta ~ beta(2, 2);                  // Prior
}

data {
  int<lower=0> N;
  real rtp[N];
}
parameters {
  real mu;
  real<lower = 0> sigma;
}
model {
  rtp ~ normal(mu, sigma);
  mu ~ beta(2, 2);                     // Mode = 0.5
}

Inference for Stan model: normal.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

       mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
mu     0.77    0.00 0.06  0.65  0.73  0.77  0.81  0.88  2719    1
sigma  0.64    0.00 0.04  0.56  0.61  0.64  0.67  0.74  3265    1
lp__  -7.25    0.02 0.94 -9.68 -7.63 -6.96 -6.58 -6.32  1901    1

Samples were drawn using NUTS(diag_e) at Fri May 11 04:26:55 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

The average RTP is around 77%.

Pay Table: Cat and Mouse

description symbols payout
0
1x mouse 🐭 1
1x cat 🐱 2
2x mouse 🐭🐭 5
2x cat 🐱🐱 10
3x mouse 🐭🐭🐭 20
3x cat 🐱🐱🐱 100
data {
  int<lower=0> N;
  real rtp[N];
}
parameters {
  real<lower=0, upper=1> mu1;          // Payline 1 -> 1x
  real<lower=0, upper=1> mu2;          // Payline 2 -> 2x
  real<lower=0, upper=1> mu3;          // Payline 3 -> 5x
  real<lower=0, upper=1> mu4;          // Payline 4 -> 10x
  real<lower=0, upper=1> mu5;          // Payline 5 -> 20x
  real<lower=0, upper=1> mu6;          // Payline 6 -> 100x
  real<lower=0> sigma;
}
transformed parameters {
  real<lower=0> pay;
  pay = mu1 * 1 + mu2 * 2 + mu3 * 5 + mu4 * 10 + mu5 * 20 + mu6 * 100;
}
model {
  rtp ~ lognormal(log(pay) - sigma * sigma / 2, sigma);
  //
  // The mean of lognormal() is exp(mu + sigma * sigma / 2).
  //
  mu1 ~ beta(2, 2);                    // Mode = 0.5
  mu2 ~ beta(2, 4);                    // Mode = 0.25
  mu3 ~ beta(2, 5);                    // Mode = 0.2
  mu4 ~ beta(2, 8);                    // Mode = 0.125
  mu5 ~ beta(2, 10);                   // Mode = 0.1
  mu6 ~ beta(2, 16);                   // Mode = 0.0625
}

summary(fit)$summary
               mean      se_mean          sd          2.5%           25%           50%           75%         97.5%
mu1     0.140183104 1.469255e-03 0.089325826  1.879710e-02  7.177827e-02   0.122664752   0.192136367   0.355484201
mu2     0.068228582 6.855675e-04 0.043359095  1.009871e-02  3.522552e-02   0.059594974   0.092294699   0.172075969
mu3     0.028816792 3.014155e-04 0.019063192  3.628413e-03  1.423612e-02   0.025236838   0.039359326   0.074382807
mu4     0.014604142 1.465113e-04 0.009266188  2.047886e-03  7.471645e-03   0.012931411   0.019858629   0.036488600
mu5     0.007310498 7.466952e-05 0.004625688  1.032144e-03  3.817441e-03   0.006498005   0.009835081   0.018626924
mu6     0.001500567 1.496690e-05 0.000946590  1.915782e-04  7.711689e-04   0.001324914   0.002041584   0.003783854
sigma   0.740078933 1.058005e-03 0.057922464  6.352298e-01  7.001666e-01   0.737313210   0.776424743   0.860629274
pay     0.863032278 1.374708e-03 0.079680070  7.286920e-01  8.064539e-01   0.855920287   0.909859834   1.038541926
lp__  -67.997071371 4.994067e-02 2.062849762 -7.287027e+01 -6.912510e+01 -67.655330546 -66.467193435 -65.058522027
         n_eff      Rhat
mu1   3696.236 0.9991526
mu2   4000.000 0.9995266
mu3   4000.000 0.9995504
mu4   4000.000 1.0004483
mu5   3837.655 1.0004658
mu6   4000.000 1.0003130
sigma 2997.221 1.0006985
pay   3359.532 0.9997769
lp__  1706.186 1.0028480

The overall RTP is NA.

Predictive Inference

Posterior Predictive Distribution

\[ p(\tilde{y}|y) = \int_\theta p(\tilde{y}|\theta) p(\theta|y) d\theta \]

What is Bayes Good For?

  • Models where we care about parameter values.
  • Models where we want to quantify uncertainty in parameter values.
  • Generate some derived quantity from the posterior (because we can evaluate any function of the parameters…)

Resources

Extra Stuff

Poisson

# trials <- list(
#   N       = nrow(sessions),
#   spins   = sessions$spins
# )
# 
# fit <- stan(
#   file    = "poisson.stan",
#   data    = trials,
#   chains  = 4,
#   warmup  = 1000,
#   iter    = 2000
# )
# extract(fit)

# hist(extract(fit)$lambda)

Regression

# ggplot(sessions, aes(spins, wager)) + geom_jitter()

Regression Stan

data {
  int<lower=0> N;
  vector[N] y;                         // Total wager
  vector[N] x;                         // Number of spins (standardised)
}
parameters {
  real alpha;
  real beta;
  real<lower=0> sigma;
}
model {
  y ~ normal(alpha + beta * x, sigma); // Likelihood
  //
  alpha ~ normal(0, 10);               // Prior (Normal)
  beta ~ normal(0, 10);                // Prior (Normal)
  sigma ~ cauchy(0, 5);                // Prior (Cauchy)
}

Regression Stan R

# summary(fit)

ShinyStan

library(shinystan) launch_shinystan(fit) # # Diagnose -> PPcheck “Posterior predictive check” (look at distribution of observed data versus replications - do they look similar? “If our model is a good fit then we should be able to use it to generate data that looks a lot like the data we observed.”)